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Could you please let me know if you think this resolves the issue? We were not 
sure why this figure assumed such importance. 
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Figure 7. Correlation matrix of 30 protein spots (columns) 
with mRNA levels as measured by 200 probe-sets on 
Affymetrtx HuFL chips. The correlation coefficients are 
depicted with colors, bnght red being near-perfect corre- 
lation (r = 1) and bright green anticorrelation (r « -1). The 
figure was made using the TreeView software {ranaJbl. 
gov/EisenSoftware.htm) . 
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IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 



In the application of: 

Arthur B. RAITANO, et al. 

Serial No.: 09/680,728 

Filing Date: 5 October 2000 

For: NOVEL G PROTEIN- COUPLED 
RECEPTOR UP-REGULATED IN 
PROSTATE CANCER 



Examiner: Minh-Tam B. Davis 
Group Art Unit: 1642 
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DECLARATION OF MARY FARIS 



I, Mary Faris, declare as follows: 

1 . I am currently a Group Leader at Agensys, Inc., the assignee herein. Prior to my 
employment at Agensys, I was a Senior Scientist at Incyte Genomics. I have a Ph.D. in 
Immunology and Microbiology from Ohio State University and have held post-doctoral 
fellowships at the University of Virginia and the University of California at Los Angeles, School 
of Medicine. While at Incyte, I had considerable experience in expression analysis of cellular 
mRNA using chips with multiple probes. A copy of my curriculum vitae is attached as 
Exhibit A. 

2. I am aware that a question was raised as to the substance of Figure 7 that appeared 
in an article by Oh, J.M.C., in Proteomics (2001) 1 : 1303-1 3 19. A scanned copy of this Figure, 
which is in color, is attached as Exhibit B and a copy of the article itself is attached as Exhibit C. 
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3. The Oh, et al. article discusses a database for use in analysis of protein expression 
in lung cancers. I agree with the statement made in the Declaration of Pia M. Challita-Eid, that 
the Oh et al. article would not have merited publication were it not expected that there is a 
correlation between the existence of mRNA and corresponding protein. 

4. The explanation of Figure 7 in the Oh, et al. article is quite brief. It is discussed 
only at pages 1316-1317 in § 5.2. As stated in § 5.2 and as confirmed in the Figure legend, 
Figure 7 consists of 30 columns, each representing a protein spot obtained on a 2D gel. For each 
column, there are 200 entries, one per row, each representing probes for mRNA on Affymetrix' 
HuFL chips measuring correlation or anti-correlation with the respective protein. The Figure 
provides 30 discrete sets of data, one set per protein, arranged side-by-side. Thus, the Figure 
represents selected, detectable proteins on 2D gel and whether a quantitatively similar amount of 
RNA corresponding to that protein was expressed at one point in time. 

5. I cannot identify from the article which 30 proteins are represented; it appears 
from the explanation in § 5.2 that some of them may be unidentified. Thus, I believe that this 
Figure is intended to reflect the data presentation concept set forth by Oh et al. 

6. I am familiar with the Affymetrix' HuFL chips, and understand that they contain 
probes for mRNA encoding a wide variety of proteins. Thus, for each individual protein column, 
at most only some of the probes would even be expected to hybridize with mRNA that actually 
encoded the protein. 

7. As explained in the Figure 7 legend, the level of RNA to protein correlation is 
represented in color, with red being a near-perfect correlation, green being a negative correlation, 
and black is not defined. I believe from the results that Figure 7 appears to provide data for two 
protein "families." The mRNA that are near-perfectly correlated with one family are generally 
anticorrelated with the other family, as would be expected. This expression pattern correlates to 
the general quadrants in the Figure. 
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8. It appears that in all cases there is some mRNA for which a high correlation is 
found. These data actually support the assertions being made in the present case concerning the 
qualitative correlation of RNA to protein: In all cases RNA exists that is highly correlated with 
the existence of the protein (some RNA is simply unrelated to this protein). Thus, each protein 
perfectly correlated with existence of relevant RNA. 

I declare that all statements made herein of my own knowledge are true and that all 
statements made on information and belief are believed to be true; and further, that these 
statements are made with the knowledge that willful, false statements and the like so made are 
punishable by fine or imprisonment or both, under Section 1001 of Title 18 of the United States 
Code and that such willful false statements may jeopardize the validity of the application or any 
patent issued thereon. 

Executed at Santa Monica, California, on April 2003. 

Mary Faris 
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Enclosures: 

Exhibit A: Curriculum Vitae of Mary Faris 

Exhibit B: Figure 7 (color) from an article by Oh, J.M.C., et al. in Proteomics 
1:1303-1319(2001). 

Exhibit C: A full copy of the Oh, et al., article. 
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A database of protein expression in lung cancer 

We have developed a comprehensive approach to identifying molecular changes in 
lung cancer that includes both genomic and proteomic analyses. The related effort 
has produced a large amount of data pertaining to gene expression at the RNA and 
protein levels. As a result, we have constructed a database that contains protein 
expression data on lung cancer as well as other relevant data including DNA micro- 
array derived data. A large number of proteins that are expressed in different types 
of lung cancer have been identified and have been correlated with the expression 
measures for their corresponding genes at the RNA level. The database is intended to 
facilitate our effort at developing novel classification schemes for lung cancer and the 
identification of novel markers for early diagnosis. 



Keywords: Lung / Cancer / Database / Microarray 
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1 Introduction 

There is substantial interest in implementing novel and 
comprehensive strategies for the molecular analysis of 
tumors and relevant biological fluids. We have implement- 
ed a strategy for the molecular analysis of lung cancer 
that integrates genomic analysis using genome scanning 
procedures, transcriptomic analysis using cDNA and 
oligonucleotide microarrays, and proteomic analysis. For 
the latter, we have relied to date primarily on 2-D poly- 
acrylamide gels. However the 2-D gel approach is being 
increasingly complemented with additional analyses 
using liquid based protein separations and protein micro- 
arrays. While on the one hand proteomic analysis com- 
plements genomic analysis for a global assessment of 
gene expression, on the other hand proteomic analysis 
uniquely contributes an understanding of protein post- 
translational modifications and the location of protein 
gene products in subcellular compartments. The scope 
of our overall molecular analysis study of lung cancer is 
shown in Fig. 1 . Important objectives include the develop- 
ment of novel molecular classification schemes for lung 
cancer and the identification of novel markers for the early 
detection of lung cancer. 

The large body of proteomic and other data we have col- 
lected has necessitated the construction of a database in 
which basic and derived data is organized. There have 
been relevant related efforts at databasing of 2-D data 
by other groups. One such database is the 2 DWG Meta- 
database of 2-D gel images, which contains 2-D derived 



Correspondence: Dr. S. Hanash, University of Michigan Medical 
Center, 1150 W. Medical Center Drive, A520 Medical Science 
Research Building I, Ann Arbor Ml 481 09-0656, USA 
E-mail: shartash@unoich.edu 
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data acquired by a combination of review of results as 
well as submissions by investigators [1]. However, to 
date there are only three entries found matching the query 
for human lung images in the 2 DWG Web Gel Meta-data- 
base web site (http://wvvw-lecb.ncifcrf.gov/2dwgDB). 
The database we have constructed, in its entirety, is rele- 
vant to a variety of cancers. However the focus of this 
review is the use of the database to achieve our objec- 
tives related to the molecular analysis of lung cancer 
specifically. The goal of the database is to facilitate 
planned analyses, i.e. statistical analysis, as well as 
post-planned analyses, i.e. data mining. The intent is to 
make the database queryable on a protein - by - protein 
basis as well as through sub-grouping of samples ana- 
lyzed, in a menu driven fashion. Internet and WWW tech- 
nologies are used not only to allow investigators to view 
visual and textual data together, but also to allow investi- 
gators in other locations to retrieve archival data using 
different computer systems. 



2 Laboratory information processing 
system 

A long-standing Laboratory information processing sys- 
tem (LIPS) developed by our group [2] has been adapted 
for our database. LIPS consists of multiple systems and 
processes. A variety of data is stored in a variety of formats 
with individualized programs for viewing the data. Typical 
processes using LIPS include: sample inventory; digitize 
images; detect and quantify spots; match spots and nor- 
malize spot sizes across images, choose spots for MS 
analysis, enter profiles from MS-Fit web search; transfer 
data to statistical software or spreadsheets. 

Data tend to be complex and dynamic in that their con- 
tents are ever changing as information is added, modified 
or removed. Simple or intensive analyses of 2-D patterns 
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Figure 1 . Methods and goals of 
lung cancer studies. 



have produced a targe amount of data. Data is both tex- 
tual (e.g. t reports and numbers) and visual (e.g., 1-D and 
2-D gel images). 

Some types of data generated by LIPS include: 2-D pro- 
tein gel images (silver, modified silver, blots, labeled 
gels); genome scans; 1-D gel images; spot information- 
protein names; gene information from DNA microarrays; 
MS files and MS-Fit reports (Word documents); figures 
(Raster files on the Sun and actual photographs); data 
from protein microarrays; data from liquid chromato- 
graphy separations. 

However, as computer technology has evolved, quantum 
jumps in improvements in organizing unstructured, scien- 
tific data into a structured database have become possible. 
A major function of our database and its interfaces is to 
serve as a computer-based tool for capturing the basic 
quantitative data from 2-D gel images and derived data 
and findings derived from different studies about proteins 
detected in 2-D patterns of various tumor types [3]. As a 
result, investigators are provided with easy access to data 
as well as a means for intelligent data mining of the existing 
data. A logical view of the database schema is shown in 
Fig. 2 and a list of tables and their attributes are shown in 
Table 1. 

The following are important features of the 2-D gel related 
component of our lung protein database: 

(1) All 2-D gel images are placed in hierarchies so that: (a) 
every study image is matched to one master image, ie. all 
lung adenocarcinoma tumor images are matched to one 
master image; (b) every master image is matched to at 
most one (higher) master image, i.e. all masters for differ- 
ent lung tumor types are matched to one tumor master. 



This allows the database to have an indexing mechanism 
that can relate a spot to any gel in the hierarchy. The data- 
base provides a capability to access the basic and 
derived data using the following types of queries: (a) given 
a spot on any gel, find all spots that are matched to it; (b) 
given a spot on any gel, find all protein identifications 
made for it, and (c) given a spot on any gel, find all find- 
ings/conclusions that are linked to it. 

(2) Ail samples (and thereby gels derived from them) are 
identified by a list of source characteristics in four major 
categories: experiment code; cell type code; treatment 
code; and fraction code. This allows the database to 
have an identification mechanism that can relate a gel to 
any source in the hierarchy. The database provides a cap- 
ability to find all images as follows: (a) given a category, 
find all images that have the same value of the category; 
and (b) given any combination of four categories, find all 
images that satisfy the condition. 

(3) All protein spots are identified by a list of characteris- 
tics in four major attributes: protein name; pi and M r ; 
accession number; and protein sequence data. A spot 
may have several findings and there may be many kinds 
of findings derived from a particular study. If possible the 
findings are recorded in a consistent way, however this is 
not always possible due to some characteristics of such 
findings (e.g., statistical analysis matrices, MS data, and 
Affymetrix data). As the number of studies has increased, 
the amount of data produced has increased. Some of the 
data [e.g. mass spectra and Affymetrix (Santa Clara, CA, 
USA) oligonucleotide chip readouts) is very large, and fills 
up the hard disks of the computers where it is collected. 
Such data is generally saved on CD-Rs, and only the most 
recent data is kept in a computer. It is sometimes easier 
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Table 1 . A list of tables and their attributes in the lung protein database 



Table name 


Unique identifier 




(Primary Key) 


Project 


Project Name 


Sub Project 


Sub Project Name 


Subject 


Subject ID 


Tissue Sample 


Tissue Sample ID 


DNA Sample 


DNA Sample ID 


Gel 


Gel ID 


Image 


Image Name 


Spot 


Image Name & Spot No 


Match 


Match ID 


Experiment 


Experiment Code 


Cell Type 


Cell Code 


Treatment 


Treatment Code 


Fraction 


Fraction Code 


Protein Sample 


Sample ID 


Protein 


Protein Name 


Other Link 


Protein Link ID 



List of attribute types 



Findings 

Protein Identification 



Image Name & Spot No 
Image Name & Spot No 



Project Type, Description 

Date Started, Comment 

Case No, Sex, Birthdate, Comment 

Tissue Type, Diagnosis, Date SampleTaken, Date Received, How 

Received, Source, Comment 
Date Produced, Concentration, Freezer Location, Comment 
Sample ID, Batch ID, Enzyme Combination, Electrophoresis 

Process, Comment 
Date Imaged, Exposure Time, Image Type, Image Location, 

Comment 
X, Y, Intensity, Spot Type 

Master Image Name, Master Spot No, Image Name, Spot No 

Description 

Description 

Description 

Description 

Experiment Code, Cell Code, Treatment Code, Treatment Date, 
Fraction Code, Comment, Project Type, Gel ID, Image Name, 
Image Type, Researcher 

Image Name, Spot No 

Protein Name, Database Name, URL 

Category, Designation, Finding 

Accession No, cDNA cloning, Cell Lines, Facility, Date, 
Genomic Cloning, Glycosylation, M r , p/, Phosphorylation, 
Phosphorylation Residues, Related Spot, Sequences, 
Source of Protein, Name, Structural Modification, Subcellular 
Localization, Tissue Distribution, Type of Membrane, 
Type of Sequencing 
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to post individual files on the web. Individual web pages 
have been created with textual and visual data that are 
difficult to relate in a table. This allows investigators an 
opportunity to analyze 2-D gel and other images contain- 
ing spots that have not been detected or identified and to 
compare data across studies. In addition this is used to 
link our data to other biological knowledge repositories 
such as GenBank, PIR International, and SWISS-PROT. 



3 Contents of the lung cancer protein 
database 

A large number of studies involving lung cancer have been 
independently performed in the laboratory. At the protein 
level, these studies have resulted in 1349 images, over 
1000 of which are images of 2-D gels for which information 
has been recorded in the lung protein database. This num- 
ber represents a fraction of the 30 682 2-D gels produced by 
our group fordifferent studies, which include studies of other 
cancer types encompassing head and neck, esophagus, 
liver, colon, pancreas, ovary, breast, prostate, brain, leu- 
kemias and childhood tumors. A list of protein gel images 
related to lung studies is shown inTable 2. While lung adeno- 
carcinomas represent a major portion of the database, 
other lung tumor types including squamous cell carcinomas 
and small cell lung cancers are represented, as are control 
lung tissues. Other 2-D patterns were produced from 
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Table 2. A high-level categorization of lung protein 2-D 
images by sample type 



Lung Sample Types 



Cell Lines • 


421 


Cystic Fibrosis 


44 


Tumor 


635 


Normal 


170 


Plasma 


61 


Other 


18 


Total 


1349 



studies of cell lines that have been manipulated by trans- 
fection or by treatment with specific agents, as well 
as patterns produced after different cell fractionation 
schemes. Substantial emphasis is currently being placed 
on the comprehensive profiling of lung cancer derived 
surface membrane proteins. 

Mass spectrometry and/or /V-terminal sequencing of pro- 
tein spots from 2-D gels of lung tumor samples or cell 
lines have led to the identification of a large number of 
proteins expressed in lung cancer. Also, most identifi- 
cations made for proteins from a sample type can often 
be confidently transferred to matching protein spots on 
master images from lung studies. Table 3 and Fig. 3 ex- 
hibit some of the progress we have made in identifying 
proteins in 2-D gels of lung samples. 




Figure 3. Small cell lung tumor 
master containing identified pro- 
teins. 
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Spot# 


NCBI 

Accession 
Number 


GenBank 
Number 


P' 


Mr 


Official 
gene 


1496 












577 




D11 QA7 

ro i y4 / 


4.361 


30.052 


SFN 


Dl O 


1 1 ^oyo 




4.569 


29.101 


YWHAZ 


i 07Q 

i*£/y 


40/000 








YWHAH 


24 


2507178 


P16118 






PFKFB1 






Kip 001649 


6.31 


20.697 


ARF1 


11 Q 










ALB 


ouu 










ALB 








5.957 


70.244 


ALB 


207 


4502031 


NP_000680 


6.811 


56.966 


ALDH1A1 


543 


3493209 


AAC36469 


7.812 


32.379 


AKR1B10 


14 


lOUYoY 


Df|i;i 07 


5.86 


57.954 


ALPP 


OOD 










ALB 












ALB 


332 


4503571 


NP_001419 


7.742 


45.407 


EN01 


268 


8272482 


AAFZ4221 






HCR 


268 


5360901 


BAA82158 








170 


ci 74477 


NIP 006073 


5.099 


52.848 










4.796 


17.194 




AfsCi 




P04083 


6.73 


39.264 


ANXA1 


occ 


4502107 


NP_001145 


4.83 


33.326 


ANXA5 


685 






5.124 


25.4 




1278 


71967 


LNHUPS 














6.33 


18.909 


ARF1 


Oft 

yo 




171 341 OA 


5.34 


14.584 


LGALS1 




113270 


P02570 


5.29 


41.7 


ACTB 




OQ/1Q7 








SPTB 




4507729 


NP_001060 


4.75 


49.8 


TUBB 




1 1QQ5077 


AB 0382 11 








104 


4757900 


NP.004335 


3.668 


57.29 


CALR 


469 






3.442 


48.772 




36 


4929561 


AAD34041 


6.25 


49.296 




•i AQ 

i4y 




MP 001753 


7.034 


60.547 


CCT6A 


l OOO 


4509899 


NP_001824 






CLTA 


789 










COL15A1 


85 


1362772 


E57233 








856 






5.415 


11.858 


CRABP2 


855 


4506451 


NPJX)2890 


4.667 


10.297 


RBP1 


439 


180570 


AAC31758 


5.34 


42.618 


CKB 


872 






4.568 


9.2 


KRT8 


321 


1673575 


U76549 







ID 

Source 



Name 



L95 



LM 
LM 
L95 

DMS79 



LM 
LM 
LM 

SKMES 



LM 
L95 
L95 

LM 

LM 

A549 

LM 

LM 

L95 

LM 

SKMES 
LM 

DMS 79 
LM 

DMS 79 

LM 
LM 

A549 
A549 
L95 

DMS 79 
LM 

LM 



LM 

A549 



(spot 1496L) possibly 
pacreatitis-associated 
protein 

14_3_3_sigma 

14_3_3_2etaDelta 

14-3-3n 

6PF-2-K/FRU-2.6-P2ASE 

Liver isorymer 
ADP-ribosylation factor 1 
Albumin 
Albumin 
Albumin 

Aldehyde Dehydrogenase 
AldoKeto Reductase 
Alkaline Phosphatase, 

Placental type 1 precursor 
Albumin 
Albumin 
a-Enolase 
a-helical protein 
a-helix coiled-coil rod 
homolugue 
a Tublin 
Amyloid B4A 
Annexinl 
Annexin V 
ApoAl 

Apoprotein, pulmonary 

surfactant 
ARF1 

p-Galactoside soluble lectin 

p-Actin 

p-spectrin 

P Tubulin 

Calmodulin dependant 
phosphodiesterase 

Calreticulin 

Calreticulin32 

CGM6 protein 

Chaperonin-like protein 

Ciathrin light chain A 

Collagen, type XV, a 1 

Complexin II 

Cellular retinotc acid- 
binding protein 2 

Cellular retinol-binding 
protein 1,CRPB1 

Creatine kinase, brain 

Cytochrome C bxydase VA 

Cytokeratin^ 
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in 


INCH 1 ICS 


Spot# 


NCBI 


GenBank 


P' 


M t 


Official 


oource 






Accession 


Number 






gene 






Number 










A549 


Cytokeratin 8 


446 


2506774 


P05787 


C CO 


53.674 




A549 


Cytokeratin 8 


439 


2506774 


P05787 


5.52 


CO C7/( 

53.0 A 4 


KRT8 


LM 


Cytokeratin 15, keratin 15 


289 






*+. 1 JO 


49.261 


KRT15 


A549 


Dihydrolipoamide dehydro- 
genase, mitochondrial 


759 


118674 


P09622 










precursor 










o-i m r 


n M 


LM 


DJ1 


811 


6005749 


MO rtAO-IGO 

ISIr__0Uy pyo 


C AA 


LM 


DJ1.MER5 


700 






6.263 


24.001 




DMS 79 


dj475N16.1(CTG4A) 


57 


6969163 


CAB75301 




20.136 




LM 


DUTPhase 


769 






5.719 


MIP9 

nlrt 


L95 


E2 ubiquitin-conjugating 


1445 


4885417 


AB022435 








enzyme 










22.961 




LM 


EIF4d 


718 






5.104 




LM 


EIF5A 


839 






4.599 


10.957 


ERH 




Enhancer of rudimentary 


902 












(Drosophila) homolog 










47.286 


EN02 




" Enolase 2 (y, neuronal) 


295 


119347 


P09104 


4.94 


LM 


ENPL.HSP100 


18 








78.717 


ATP5JD 


A549 


F1 FO-type ATP synthase 


1519 


5453559 


NP_0063475 


5.21 


18.491 




subunit d 












CCNE1 


DMS 79 


G1/S specific cyclin E1 


31 


3041657 


P24864 




31.772 


LM 


G3PD 


540 






7.457 


ACTG1 


LM 


y-Actin 


348 


113278 


P02571 


5.146 


42.315 


LM 


Glyoxaiasel 


650 


417246 


Q04760 


4.833 


25.572 




FM0 79 


Granulocyte-macrophage 
colony-stimulating factor 


86 


117561 


PU414 1 










precursor 










73.124 




LM 


GRP75 


87 






5.9341 




LM 


GRP78 


79 






5.187 


68.109 


Oo I r 1 


LM 


GSTpi 


690 


726098 


AAC13869 


5.5 


nc A 


Heat shock 27 kD protein 1 


626 


123571 


P04792 


7.8*3 


OO 0.07 


HQPR1 




Heat shock 27 kD protein 1 


631 


123571 


P04792 


7.83 


OO 007 


UQDR1 

norDl 


A549 


Heterogeneous nuclear 


457 


5031753 


NP_00551 1 






HNRPH1 




ribonucleo protein H 










^fi 558 




A549 


HLA-B71 orHLB-B71 


818 


51 1776 










variant 










72.429 




LM 


HSC70_HSP73 


120 






5.893 




LM 


HSP90 


46 








76.096 




L95 


HSPC089 


1036 


6841 118 


A ACOQOH O 








L95 


HSPC321 


1547 


6841292 


MArtoyyy 








L95 


HSPC321 


1548 


6841292 


AAF28999 






HSPD1 


A549 


HuCha60SP60 


181 


4504521 


NP_002147 


5.7 


61 


L95 


Huntingtin associated protein 


1595 


1708113 


P54255 






HAP1 


L95 


Huntingtin associated protein 


1548 


1708113 


P54255 




54.908 


HAP1 




Intemexin neuronal intermediate 183 


6225015 


Q16352 


5.48 


INA 




filament protein, alpha 










48.106 


KRT17 




Keratin 17 


934 


4557701 


NP.00413 


4.97 


DMS 79 


KIAA1 610 protein 


26 


10047295 


AB046830 




44.03 




LM 


LamR 


340 






4.549 
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Table 3. Continued 



ID 


Name 


Spot# 


NCBI 


GenBank 


P' 


Mr 


Official 


Source 






Accession 


Number 






gene 






Number 












Lectin, galactoside-binding, 


Of o 


007Q0n 


171 341 OA 


5.34 


14.584 


LGALS1 




soluble, 1 (galectin 1) 














LM 


Lipocortin 






rvHUOO 


6.73 


39.264 


ANXA1 


A549 


L-Lactate Dehydrogenase 


906 


126041 


P07195 






LDHB 




H chain 














A549 


L-lactate dehydrogenase 
H chain (LDH-B) 


906 


4557032 


kid nnoooi 
Nr_UU22y J 






LM 


LaminB 


924 






c; 7A7 
0. 1 Of 


RQ fiPS 






Lymphocyte cytosolic 


924 


4504965 


NP_002289 


5.20 


70.290 




protein 1 (L-plastin) 












roMAo 


L95 


Macropain subunit zeta 


1338 


4506187 


NP_002289 






DMS79 


MHC class 1 
histocompatability 


33 


1236790 


U06487 










antigen protein 










30.239 




DMS79 


Multicalaytic endopeptidase 
comples chain C2, 


74 


346314 


JC1445 


b.o 1 






— long splice from 










15.172 




L M 


rviyosinLiyiitocii m w 


815 






4.11 




A549 


Nm23, (NlJri\A 




127981 


P15531 


5.809 


19.216 






Non metastatic cells 1 , 


7Q^ 


4CC77Q7 


NP 000260 


5,83 


17.148 


NME1 




proxein ^iNivitOMj 










17.164, 


LAP18 


1 M 
L M 


Hn 1 R Ipukpmia-associated 


809 


5031851 


NP_005554 


5.783 




phosphoprotein p18 (stahmin) 










13.655 


LAP18 


LM 


Op 18a 


807 


5031851 


NP.005554 


4.962 


LM 


Op 18m 


808 


5031851 


NP_005554 


5.302 


14.857 


LAP18 


LM 


Phosphoglycerate MutB 


639 






7.083 


27.227 
56.5 




LM 


PhospholipaseC 


OA O 






5.7 




LM 


PIMT 


eco 
oo2 






6.21 1 


25.804 




L95 


Pinch-2 protein 


1695 


9800509 


AAF99328 








L95 


Pinch-2 protein 


1825 


9800509 


AAF99328 








L95 


Possibly acttvin type II 
receptor precursor; 
DNA polymerase epsilon 
subunit B; or ITF-i DNA 
binding protein 


627 












L95 


Possibly BTF2p44 


1496 












A549 


Possibly carbonci anhydrase III 
or UCH-L1;PGP9.5 


1242 












A549 


Possibly 5-3,5 5-2,4- 
Dienoly-CoA isomerase 
precusor 


2138 












A427 


Possibly G1 to S phase 
transition Drotein; serine- 
theonine phosphatease 
protein; or phosphatase 
5 protein 


321 












L95 


Possibly GCF2 fusion 
protein or Bamacan 
homolog 


320 












L95 


Possibly glycosyltransferase 


1519 












L95 


Possibly HLA DQ 


1271 
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ID Name Spot# NCBI GenBank pi M r Official 

Source Accession Number gene 

Number 



A549 Possibly hydroxyacylglutathione 1080 

hydrolase or B-Iymphocyte 

Antigen CD20 
L95 Possibly microtubule-based 1438 

motor protein 
L95 Possibly putative novel 1 427 

protein similar to HPS 
L95 Possibly Spi-B; unnamed 1 1 87 

protein product (AK001844); 

or protein kinase (y15801) 
L95 Possibly T-complex protein 630 

A549 Possibly U 1 small nuclear 1 1 48 

ribonuclear protein A 



L95 


Possibly unnamed protein 
product (AK000369) or 
syntaxin 


1064 












L95 


— Possibly unnamed protein 
product orPro0282p 
protein 


1351 














procollagen-proline, 


110 


2507460 


P07237 


4.76 


57.116 


P4HB 




2-oxoglutarate 4-dioxgenase 
















(proline 4-hydroxylase), beta 
















polypeptide (protein disulfide 
















isomerase; thyroid homone 
















bindung protein p55) 
















proliferating cell nuclear 


515 


129697 


P1 7070 


4.4 


07 C 


DOM A 




ant in An 












PPP2R1B 




Protein phosphatase 2 


104 


5915686 


P30154 


4.84 


66.202 




(formerly 2A), regulatory 
















subunit A (PR 65), p-isoform 














LM 


Protein H precursor 


40 






3.714 


62.182 




LM 


Protein kinase C inhibitor 1 


882 


4885413 


NP_005331 


7.714 


11.521 


HINT 


L95 


Pulmonary surfactant 


1278 


190565 


AAA36510 






SFTPA1 




apoprotein precusor 












SFTPA1 


L95 


Pulmonary surfactant- 
associated protein 


1278 


131412 


P07714 






LM 


R33729J 


848 


3355455 


AAC27824 


7.508 


13.163 






Retinol-binding protein 1 , 


855 


4506451 


NP.002890 


4.99 


15.850 


RBP1 




cellular 














LM 


RoSS_A_Antigen 


69 






3.215 


47.903 


S100A11 




S100 calcium-binding 


906 












protein A1 1 (calgizzarin) 
















S100 calcium-binding 


910 


115442 


PO5109 


6.51 


10.834 


S100A8 




protein A8 (calgranulin A) 












S100A9 




S100 calcium-binding 


931 


6094219 


P50117 


6.37 


13.291 




protein A9 {calgranulin B) 










66.202 


PPP2R1B 


DMS 79 


Serine/threonine protein 
phosphatase 2A, 65kDa 
regulatory Subunit A, 
p isoform 


14 


5915686 


P30154 


4.84 




SET translocation (myeloid 


376 


1711383 


Q01105 


4.12 


32.103 


SET 




leukemia-associated) 
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Table 3. Continued 



ID 


Name 


Spot# 


NCBI 


GenBank 


p/ 




Official 


Source 






Accession 
Number 


Number 






gene 




* 

Small glutamine-ncn 


** / O 


O I04D00 


043765 


4.81 


34.063 


SGT 




tetraricopeptide repeat 
















fTPR\-frintaininn 
\ 1 r ry-UUt 1 lall HI ly 
















o train in 


^77 


398953 


P31947 


4.68 


27.774 


SFN 


L M 


Ci maroviHoHicm C*Al7f\ 
OUpclUAlUCUIol 1 1 wULI 1 


792 


1 3461 1 


P00441 


5.6 


17.3 


SOD1 


L Ivl 


oupciUAiue isioi iiiviin, 
«?nnproxide dismutase 2 
mitochondrial 


737 


134665 


P04179 


7.887 


20.78 


SOD2 


1 M 


TCP 1 ft ^ubunit 


202 






5.89 


59.841 




L M 


tumor protein 1) 


680 


4507669 


NPJ303286 


4.688 


25.143 


TPT1 


L M 


Thioredoxin 


896 






4,689 


9.207 




L M 


Tplastin HSP 70 


125 






5.862 


68.909 




L M 


Transthyretin 


842 






5.693 


14.714 






Triosephosphate isomerase 


672 


136060 


P00938 


7.2 


25.5 


TPI1 


L95 


Tropomyosin, cytoskeletal 
type, tropomyosin 5 


550 


136096 


P12324 


4.5 


31.9 






Tropomyosin 4 


548 


13274400 


AAK17926 


4.377 


32.733 


TPM4 


L95 


Troponin T 


866 


408217 


AAR27731 








L95 


Troponin T 


778 


408217 


AAB27731 










Tublin, p polypeptide 


229 


*tOU/ I l9 


nip nmofio 


4.78 


49.907 


TUBB 


DMS79 


Tumor associated hydroquinone 34 


A A H C7 

6o441o7 


A COn"7Q£M 

Ar^U/oo 1 










(NADH) oxidase tNOS 
















tyrosine 3-monooxygenase/ 


576 


1 1 DO 1 57Q 


P4266 


4.63 


29.174 


YWHAE 




tryptophan 5-monooxyge- 
















nase activation protein, 
















epsilon polypeptide 
















tyrosine 3-monooxygenase/ 


615 


112695 


P29312 


4.73 


26,645 


YWHAZ 




tryptophan 5-monooxy- 
















genase activation protein, 
















zeta polypeptide 
















Tyrosine 3-monooxygenase/ 


579 


112690 


P27348 


A CO 
4.DO 


./C>4 


NAA/U AO 
TWHAU 




tryptophan 5-monooxy- 
















genase activation protein, 
















theta polypeptide 












UCHL1 


A549 


Ubiquitin carboxyl-terminal 
esterase L1 (ubiquitin 
thiolesterase),UCH-L1; 
PGP 9.5, GSTmu 


656 


1 OOOO 1 


POQQ3R 


5.283 


27.745 


L95 


Unnamed protein product 


1270 


7023092 


BAA91833 








A549 


Urokinase plasminogen 
activator 


842 


487123 


S39495 


6.01 


31.263 




LM 


Vid1 


293 






4.712 


47.485 




LM 


Vid2 


294 






4.614 


46.369 




LM 


Vid4 
Vimentin 


337 
294 






4.464 


45.322 


VIM 


A427 


Vimentin 


606 


4507894 


NM.003380 






VIM 


A549 


Vimentin 


505 


418249 


PO8670 






VIM 


A549 


Vimentin 


47 


340234 


M25246 






VIM 
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In addition to 2-D gel analysis, most lung adenocar- 
cinomas are examined at the genomic level using restric- 
tion landmark genome scanning, and by mutation analy- 
sis for a small number of genes. Transcriptomic analysis is 
done primarily using oligonucleotide microarrays, as part 
of our efforts to derive a molecular based classification of 
lung adenocarcinomas that is more predictive of clinical 
behavior for this group of tumors than current classifi- 
cation schemes. We also have similar molecular analyses 
of control lung tissue obtained from multiple sources 
including adjacent lung tissue from lung cancer patients 
as well as tissues obtained from non-cancer resected lung. 

Only a fraction of the information in the 2-D patterns has 
been linked across all studies and analyses. The lung pro- 
tein database contains the basic descriptive data of var- 
ious samples analyzed, the images of the 2-D patterns 
that resulted from these samples, the quantitative spot 
data and information about which spots have been 
matched to each other, and conclusions or findings about 
spots. The jdatabase is intended to allow not only the 
retrieval of existing data, but also to mine new information 
and knowledge about protein expression in lung cells. 
Data mining activities consist, for example, of reviewing 
previous studies and finding out which 2-D gei patterns 
and protein spots are interesting for post-planned analy- 
sis and new discoveries. Such discoveries derive from: 
(1) identification of proteins that exhibit interesting expres- 
sion profiles in 2-D patterns that have been regrouped 
from different experiments and studies; (2) expanded 
statistical analyses that cover protein expression patterns 
involving large numbers of experiments and images; 
(3) relating our data involving proteins to outside informa- 
tion; and (4) relating proteomic data to genomic data. 



4 Use of the database for post planned 
analysis 

4.1 Virtual matching 

Interactive software packages are used to automatically 
detect and quantify spots and to match spots between 
different protein patterns, with visual editing to correct 
any errors in computer based matching. The spot match 
program has created indices that allow investigators to 
quickly navigate through many gels and easily compare 
spots on images from many different experiments and 
studies, discover proteins of interest, and access and 
view relevant data. Here the term "match" is used as a 
logical "transitive" relation, which means if spot A is 
matched to spot B and spot B is matched to spot C 
then the spots A and C are considered matched. The 
lung protein database contains data on proteins detected 



on various 2-D gels. Since all gels derived from whole 
cell or tissue lysates in the lung protein database are 
tied into a single hierarchy, protein identification data 
recorded for a spot is used to derive protein data for its 
matched spots using an advanced query capability of 
the database. This is known as "virtual matching" or "vir- 
tual protein identification", which allows investigators to 
access and view all matched images and the corres- 
ponding information from the lung protein database. 
With a click on a spot, one gets the result shown in 
Fig. 4. The virtual protein identification feature does not 
provide a 100% level of certainty of protein identification, 
but it makes possible the display of spots of interest. A 
combination of automated recognition and manual edit- 
ing generally yields an accurate record of protein infor- 
mation in the database for previously unknown proteins. 
With this approach, the lung protein database will evolve 
and mature to include all correct data for further analysis 
and data mining. 

4.2 Integrating protein spot data with MS data 

As interest in proteomic analysis grows, a number of very 
large public databases are available to access protein 
data via the internet. Public databases offer a sophis- 
ticated text search and keyword search, which links any 
entered keyword to all protein information associated with 
that keyword, to ensure easy access to all relevant data. 
Protein identification using MALDI-MS relies on database 
searches and usually has three components: (1) peak 
detection which allows automatic determination of pep- 
tide masses; (2) search in protein sequence databases 
(SWISS-PROT and/or GenBank) for protein entries that 
match the masses; and (3) certainty calculation which 
determines the quality of the match for each protein in 
the list [4]. An example of such a software tool is the Pep- 
Frag for searching protein and DNA sequence databases 
that can use different types of mass spectrometric infor- 
mation [5J. Fenyo [6] described methods and software 
tools in proteomics for identifying and characterizing pro- 
teins, which emphasizes MS combined with database 
searching. Proteolytic peptide mapping and genome 
database searching provide an automated means for 
identifying proteins, and the certainty of the results is 
computed by the number of masses matched for each 
protein [7]. Another useful tool is FindMod (http://www. 
expasy.ch/sprot/findmod.htmQ for the systematic charac- 
terisation of proteins using mass spectrometry [8]. 

We have created MS data forms that contain information 
used in mass spectrometry queries, summary information 
(Rank, MOWSE score, % Masses Matched, MW, p/, Spe- 
cies, Accession #, Protein Name) and additional informa- 
tion (Summary ID, Submitted Mass, Matched Mass, Delta 
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Figure 4. Virtual protein identification by clicking a spot. 



PPM, Start, End, Peptide Seq, Modifications, Unmatched 
Masses). An example of the MS data form is shown in 
Fig. 5. Integrating the lung protein database with MS 
data provides a record of protein identification and high 
level of integration with other public databases, although 
substantial effort is required for data collection. We are 
currently evaluating an automated or semi-automated 
method of pulling these data when new information, 
which is relevant to our objectives, is available. 



4.3 Integrating protein data with microarray data 

As technology evolves, new computer aids and methods 
are introduced for genomic analysis as well as proteo- 
mic analysis. With respect to DNA microarray platforms, 
a current goal is to construct lung specific cDNA micro- 
arrays for lung cancer investigations. In the meantime 
RNA expression data for lung cancer is being collect- 
ed using an Affymetrix oligonucleotide based system. 
This system automates the identification and quantifica- 
tion of microarray spots. Data files contain integrated 
intensities for each spot and ratios showing fold changes 
per spot. The use of oligonucleotide based microarrays 
for RNA analysis in lung cancer by our group has resulted 



in a massive amount of data. Integration of protein infor- 
mation in the lung protein database with microarray data 
allows us to extend data analysis capability to encompass 
RNA and protein data for a subset of genes. 



5 Some findings derived from the lung 
cancer protein database 

5.1 Unique proteomic pattern of small cell 
lung cancer 

A major goal of our proteomic and genomic studies of 
lung cancer is to derive novel classification schemes that 
have utility in making a diagnosis, predicting outcome and 
in making therapeutic decisions. An important first step in 
this direction is to determine the ability of proteomic pro- 
filing to distinguish between known types of lung cancer. 
Specific protein differences between different types of 
cancer have been identified by other groups. In a recent 
study of breast, ovary and lung tumors, 20 differentially 
expressed proteins were identified [9] and in a prior study, 
16 polypeptides were found to be associated with differ- 
ent histopathologica! features of lung cancer [1 0, 1 1 ]. In a 
study of 25 adenocarcinomas of the lung, 12 small cell 



1314 



J. M. C.Oh etal. 



Proteomics 2001, 1, 1303-1319 



Smjie 0 (omffitf )c. Megfe Ikikt (fcst 

' " ■ * " JartepSxiafciies). 



N - Wax.*. ^ — ? . -. r .~ . 



Protein 



rTCICil CUBE 



tur " WH* Defci ^-feptt Spent 

sa&D istiffi piatte \n m ffi iiiorvKTreocvKro 



m«»7 I # ^iXFV'ATyFATVWKSrg lkiftA£lTATW§9ffif i 
$O0i 1H7577 # ^ TTVVftVOaV^^ 

;2ttUJtti itattf 24 c jjs^p^ 

mm . 



Figure 5. MS data form 



lung cancers, and 16 squamous cell tumors, by our group 
(manuscript submitted) an initial analysis of protein 2-D 
patterns uncovered a group of 52 protein spots that dif- 
fered in average integrated intensity between the three 
groups. Performing simple two-sample f-tests gave p 
values of less than 0.05 for the 52 spots for at least one 
of the pairs of groups. Most of the spots differed between 
small cell and the remaining two diagnostic groups, with 
47 spots differing significantly between small cell and 
adenocarcinoma groups and 44 between small cell and 
squamous (p<0.05). Between the adenocarcinoma and 



squamous groups 12 spots with difference of this signifi- 
cance were found. Summary data for some of the spots is 
presented in Table 4. The first two principal components 
of the data are graphed in Figure 6, and show that as a 
group the spots distinguish small cell tumors from the 
other two tumor types fairly easily. 

We have identified 39 of this set of 52 spots by either 
N-terminal sequencing and/or MS of spot digests. Small 
cell lung cancers were characterized by higher average 
amounts for some proteins associated with cell prolifera- 
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Table 4. 39 identified protein spots found to differ between small cell, adenocarcinoma, and squamous tumors of the lung 
(n = 12, 25, 16). In the f-test columns are p values from the two-sided two-sample f-test comparing each pair of 
groups 



Spot 


Unigene description 


Offical 


Mean 


Mean 


Mean 


Mest 


t-test 


Mest 


# 




gene 


adeno- 


squa- 


small 


adenocar- 


small cell 


adenocar- 






symbol 


car- 


mous 


cell 


cinoma vs 


vs squa- 


cinoma vs 








cinoma 






small cell 


mous 


squamous 


294 


vimentin 

V H 1 1 Wl 1 LI 1 1 


VIM 


1.36 


1.16 


0.53 


0.010 


0.016 


0.509 


319 


alhtimin 

QIL/UII III f 


ALB 


2.13 


1.67 


073 


0.001 


0.005 


0.231 


666 


albumin 

OlwU 1 1 III 1 


ALB 


0.72 


0.59 


0.20 


0.002 


0.030 


0.461 


OUU 


alhnmin 


ALB 


2.34 


1.80 


0.63 


0.010 


0.034 


0.383 


O/O 


lectin, gaiaciuoiue-uiiiuiuy , ouiuutc, i 


1 flA! 91 

LunLO 1 


1.95 


1.69 


0.83 


0.000 


0.002 


0.310 




fnalprtin 1^ 
















y^o 


AnD_rihneulattnn fartnr 1 
HUr "nuuSyiaUQN IdUlUI I 


ARF1 


0.22 


0.19 


0.06 


0.012 


0.046 


0.607 


522 


annexin A5 




V.HO 


\J.cO 


u.oy 






n nip 


515 


proliferating cell nuclear antigen 


PCNA 


n 1 c 


ft 1 D 


ft OC 


n nrvo 

U.UW 


n m 1 

U.UI I 


fl ASA 


577 


stratifin 


SFN 


0.78 


< Oft 


ft >ti 

U.41 




U.UUi 




626 


heat shock 27 kD proteinl 


HSPB1 


0.87 


1.18 


0.30 


0.000 


0,002 


0.128 


631 


heat shock 27 kD proteinl 


HSPB1 


1.04 


1.35 


0.46 


0.003 


0.017 


0.277 


793 


non-metastatic cells 1, protein (NM23A) 


NME1 


0.36 


0.43 


0.59 


0.003 


0.033 


0.253 


807 


leukemia-associated phosphoprotein p1 8 


LAP 18 


0.03 


0.05 


0.92 


0.000 


0.000 


0.351 




(stathmin) 
















809 


leukemia-associated phosphoprotein p18 


LAP18 


0.55 


0.50 


3.88 


0.000 


0.000 


0.732 




(stathmin) 
















931 


S100 calcium-binding protein A9 


S100A9 


0.95 


1.18 


0.24 


0.026 


0.001 


0.447 




(calgranulin 8) 
















104 


protein phosphatase 2 (formerly 2A), 


PPP2R1B 


0.17 


0.13 


0.65 


0.000 


0.001 


0.188 




regulatory 


















subunit A (PR 65), beta isoform 
















110 


procollagen-proline, 2oxoglutarate 4-dioxy- 


P4HB 


0.10 


0.10 


0.30 


0.014 


0.049 


0.906 




genase {proline 4-hydroxy!ase) beta 


















polypeptide {protein disulfide isomerase; 


















inyruiu iiUMMwiic uiuutiiy jjiuicim y^-j-j) 
















183 


internexin neuronal intermediate filament 


IMA 


U.LR 


n c\a 

u.U*t 


U, 1 D 


n nnn 


0.000 


V.f J 1 




nmtoin alnha 

pruicin, oipild 
















229 


tubulin, beta polypeptide 


Tl 1DD 


nil 

0.14 


v.df 


ft. QO 




a nnn 


n fi9R 


289 


keratin 15 


KRT15 


0.35 


0.29 


0.65 


0.028 


ft ftfto 
o.uoy 


0.343 


295 


enolase 2, (gamma, neuronal) 


EN02 


0.10 


0.23 


0.39 


0.000 


0.065 


0.026 


376 


SET translocation (myeloid leukemia- 


SET 


0.25 


0.17 


0.71 


0.000 


0.000 


0.031 




associated) 
















439 


creatine kinase, brain 


VMS 


n 1 1 

U.l 1 


u.uo 


me 

U. ID 


v.UOO 


n nnn 




460 


annexin A1 


ANXA1 


0.43 


0.42 


0.59 


0.014 


0.026 


0.691 


476 


small glutamine-rich tetratricopeptide 


SGT 


0.16 


0.19 


0.33 


0.000 


0.000 


0.241 




repeat (TPR)-containing 
















576 


tyrosine 3-monooxygenase/tryptophan 


YWHAE 


0.40 


0.38 


0.82 


0.000 


0.001 


0.697 




5-monooxygenase acctivation protein, 


















epsilon polypeptide 
















579 


tyrosine 3-monooxygenase/trypthophan 


YWHAQ 


0.52 


0.55 


0.91 


0,000 


0.006 


0.703 




5-monooxgenase activation protein, 


















theta polypeptide 
















615 


tyrosine 3-monooxygenase/ 
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activation protein, zeta polypeptide 



1316 J. M. C. Oh etal. P 'roteomics 2001 , 7, 1303-1319 



Table 4. Continued 
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larger amounts of several protein spots detected on these 
gels that did not occur in similar gels made from cell lines 
and were thought to be cleavage products from proteins 
present in cells or plasma surrounding the tumor ceils 
(e.g. cleaved albumin). The number of protein spots that 
differed between lung adenocarcinomas and squamous 
tumors were fewer than the number of proteins that dis- 
tinguished between small cell lung cancer and the other 
two lung cancer types. EN02 was smallest in the adeno- 
carcinoma group, while ANXA5 and CKB were lowest and 
KRT1 7 and SFN highest in the squamous carcinoma sam- 
ples. Several interesting spots found in the study remain 
to be definitively identified. 



Figure 6. First two principle components for 52 protein 
spots distinguishing between lung tumor types. Small 
cell lung cancer samples are shown as squares, adeno- 
carcinomas as circles and squamous lung tumors as 
triangles. 



tion such as proliferating cell nuclear antigen (PCNA) and 
oncoprotein 18 (Op18) [12-15], particularly the once- 
phosphorylated form of Op1 8, as well as protein products 
of the UCHL1 , RBP1 , CRABP2, KRT1 5, and TUBB genes 
among others. Squamous cell and adenocarcinoma sam- 
ples had greater amounts of the S10O proteins S10OA8, 
S1 00A9, and S1 00A1 1 , as well as larger average amounts 
of both the unphosphorylated and phosphorylated 27 kD 
heat shock protein (HSPB1). These two groups also had 



5.2 Correlations between RNA and protein 
expression 

The availability of mRNA expression data from micro- 
arrays or Affymetrix chips for the same samples for which 
we have protein 2-D gel data permits several additional 
types of questions to be asked. We have thus far enter- 
tained only simple models of protein/mRNA relationships 
that ask which mRNA levels are most correlated with pro- 
tein spot sizes. Figure 7 depicts such a correlation matrix 
using colors rather than numerical data, since this makes 
it easier to visualize the relationships. In cases for which 
the identity of the protein spot is known such investiga- 
tions can answer the question of how well mRNA levels 
for a protein predict that protein's abundance. In cases 
of protein spots that have not yet been identified, or iden- 
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Figure 7. Correlation matrix of 30 protein spots (columns) 
with mRNA levels as measured by 200 probe-sets on 
Affymetrix HuFL chips. The correlation coefficients are 
depicted with colors, bright red being near-perfect corre- 
lation (r = 1) and bright green anticorrelation (r = -1). The 
figure was made using the TreeView software (rana.lbl. 
gov/EisenSoftware.htm). 

tified without high confidence, such correlations can lead 
to or confirm hypothetical spot identifications. More gen- 
erally one can search for larger groups of proteins and 
mRNA whose abundances are controlled by some com- 
mon mechanism. 



5.3 Identification of novel lung cancer markers 

We have utilized a proteomic approach to identify pro- 
teins that commonly induce an antibody response in lung 
cancer. Such identified proteins or their corresponding 
autoantibodies likely have substantial utility for cancer 
diagnosis. There is also evidence that autoantibodies 
may be present prior to clinical diagnosis and therefore 
detection of autoantibodies or of circulating antigens 
may have utility for screening and early diagnosis of can- 
cer. We have identified a battery of proteins that induce 
autoantibodies that are specific for different types of can- 
cer. We have identified a panel of autoantibodies that are 
detectable in serum of lung cancer patients at the time 
of diagnosis. The availability of a database of protein 
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expression in lung cancer has facilitated the identification 
of proteins that induce autoantibodies in addition to 
providing valuable information regarding the expression 
pattern of such antigens in different tumor types and cell 
lines. One such antigen we have identified in lung cancer 
is protein PGP 9.5 (Fig. 8) (Brichory era/, manuscript sub- 
mitted) [16]. PGP 9.5 was identified as a protein in lung 
cancer that induces autoantibodies as part of a study in 
which sera from 64 newly diagnosed patients with lung 
cancer, from 99 patients with other types of cancer and 
from 71 noncancer controls were analyzed for antibody- 
based reactivity against lung adenocarcinoma proteins 
resolved by 2-D PAGE. Gels containing separated pro- 
teins were blotted and subsequently hybridized with indi- 
vidual sera from patients or controls. Unlike controls, auto- 
antibodies against a protein identified by MS as protein 
gene product 9.5 (PGP 9.5) were detected in sera in 9 out 
of 64 patients with lung cancer. 

Circulating PGP 9.5 antigen was detected in sera from two 
additional patients with lung cancer, without detectable 
PGP 9.5 autoantibodies. PGP 9.5 is a neurospecific poly- 
peptide previously proposed as a marker for nonsmall cell 
lung cancer, based on its expression in tumor tissue. Using 
A549 lung adenocarcinoma cell line, we have demonstrated 
that PGP 9.5 was present at the cell surface, as well as 
secreted. Thus, the findings of PGP 9.5 antigen and/or anti- 
bodies in serum of patients with lung cancer suggest that 
PGP 9.5 may have utility in lung cancer screening and diag- 
nosis, as part of a panel of such proteins or their corres- 
ponding antibodies, which we have identified. 



6 Web pages 

The relational database for storage of sample, image, 
protein information and other related data is being con- 
structed in a stepwise fashion. The construction of a 
comprehensive database to collect all pertinent informa- 
tion is rather challenging and necessitates substantial 
resources. Similar effort in this area includes WebGel that 
is a web based gel database analysis system that con- 
tains previously quantified gel data generated from a 
stand-alone quantitative gel analysis system [1 6]. Public 
WebGel demonstration databases currently available 
can be found in the web site (http://www-lecb.ncifcrf. 
gov/webgel WebGel database). The task of web based 
retrieval of data from the protein database is rather com- 
plex as there are different kinds of data that may need 
to be retrieved. The microarray data could be stored in 
the database instead of Excel files, and the Access 2000 
database that the MS team utilizes could be transferred to 
the database. Tables are being built to eliminate any 
handwritten collection of data. Developing a database is 
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r Figure 8. 2-D PAGE and Wes- 

* *• 7a '< *• tern blot analysis of A549 lung 




adenocarcinoma cell proteins. 
Panel 1 shows A549 2-D protein 
pattern after silver staining. The 
boxed area is shown in panel 2, 
in which arrows point to the 
location of PGP 9.5 forms 
(spots P1 to P3) recognized by 
sera from patients with lung 
cancer and the position of the 
form P4 recognized by a poly- 
clonal rabbit anti-PGP 9.5 anti- 
serum, which also recognizes 
P1-P3. Panel 2 shows close- 
ups of western blots hybridized 
with two different sera from 
patients with lung adenocar- 
cinoma that showed reactivity 
against PGP 9.5 proteins. 



hard because of complex and very large amount of 
unstructured data generated. There are conflicting 
pressures between "using what we've got already" 
and constructing something better. Sometimes there is 
a natural break in the data, such as when a shift is 
made from one platform type to another. Then one 
could "pile up" old data and organize it neatly. On the 
other hand, when new technologies are introduced, 
they require new ways of storing the data. The lung 
protein database is continuously evolving to enhance 
the relational schema to be more flexible and compre- 
hensive and to make data processing more robust and 
automatic. 

The lung protein database is a backbone to record pro- 
teome data for many different studies and to mine the 
existing data for new discoveries. The new generation 
LIPS provides investigators web-enabled interfaces to 
the laboratory databases and 2-D images with internet 
access. There is certainly a need for sharing information 
in the database on a global basis. We have used internet 
and WWW technologies to provide a distributed process 
with easy-to-use front-end user interface. Figure 9 shows 
a top level view of a web-based process for performing 
our studies from a data processing perspective. Some of 
our web pages were developed in Visual InterDev and 
ASP development environment on Microsoft and some 
were developed in Oracle 8i and WebDB web application 
environment on Solaris. As an example, the MS data web 
page is shown in Fig. 10. Detailed "how-to w document- 
ation is provided as on-line help for recently extended 
capabilities of LIPS. 




Figure 9. Web-based process of using lung protein data- 
base. 



7 Conclusion 

The value of the database we have constructed depends 
to a large measure on its content, the quality of data and 
the ease with which data can be retrieved and analyzed. 
While the amount of data generated is already quite size- 
able, it is likely that the database will continue to undergo 
substantial expansion. Proteins are. being identified at a 
rapid pace, thus enhancing our ability to link protein 
expression data with RNA based expression data for cor- 
responding genes. As such, the database will play an 
important role in achieving our objective of developing 
novel classification schemes for lung cancer and the 
identification of novel markers for early diagnosis. The 
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database will also serve as a useful resource for other 
investigations of lung biology and of diseases other than 
lung cancer. 
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